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ABSTRACT 

The data entry quality control procedures in discrete 
data entry tasks in the National Longitudinal Study (NLS) Fourth 
Follow-up Survey are examined. Direct data entry terminals were used 
to key survey questionnaire item responses, telephone interview 
corrections, respondent background information and supplemental 
questionnaire responses into computer disk storage. Data entry error 
rates were computed on the survey questionnaires by selecting a 
random sample from each batch after initial keying of the data, 
rekeying the selected questionnaires by two additional operators and 
determing error rates on the' basis of three keyings. In the 
implementation described, the overall error rate tolerance 
established for the NLS survey was not exceeded. The variable error 
rate over samples and operators on the- selected supplemental 
questionnaires was 0* 00040; estimated character error rate was 
0.00023. The telephone interview additions and corrections, and 
directory information entry procedures are described. (CM) 
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SUMMARY 

,The NLS Fourth Follow-up data collection activities began in October 1979 
and were completed by May 1980. Data collected were coded, edited, and keyed 
directly into computer disk storage by operators through programmable direct 
data entry terminals, as in previous follow-up surveys. Several discrete data 
entry tasks were involved (follow-up questionnaire, item responses and directory 
information; telephone interview forms; and Supplemental Questionnaires) and 
this report describes the data entry quality control procedures implemented ' 
for these specific tasks. Data entry errors for fourth follow-up keying 
operations are estimated to be less than two in one thousand. 
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I . INTRODUCTION 

Fourth Follow-Up Questionnaire data were keyed directly by operators into 
computer disk storage through programmable direct data entry terminals. There 
are several advantages to direct data entry versus standard keypunch operations, 
the primary advantage being the ability to perform certain data checks at the 
time of entry. Direct data entry also elipinates the need for most manual 
coding of data as weli as rekey verification rec[uired in the standard keypunch- 
verify approach -to recording and* transmitting data. Lower error rates' also 
result from ^direct data entry. 

The NLS fourth follow-up survey included several data entry tasks, i.e.. 
Fourth Follow-Up Questionnaire item responses, F^ourth Follow-Up Questionnaire 
telephone interview corrections, respondent background information, and Supple- 
mental Questionnaire responses. The data entry quality control procedures for 
each of these tasks will be discussed in the following sections. 

"it. FOURTH FOLLOW-UP AND SUPPLEMENTAL QUESTIONNAIRE DATA ENTRY 

In the first NLS follow-up, the overall data entry errpr rate was deter- 
mined by sight-verification of a random sample of keyed questionnaire data 
versus the original hardcopy item responses. Probable t)iases in error rate 
calculations using this procedure were due to oversights and fatigue, common 
problems in the visual comparison of data. To eliminate biases introduced by 
these inaccuracies , a computer-matching procedure for determining error rates 
was developed for use in future follow-up surveys. As in second and third 
follow-up data entry, this procedure was used in calculating error rates for 
Fourth Follow-Up Questionnaire item response data entry and Supplemental 
Questionnaire keying. The basic steps in computing error rates for these two 
data entry tasks are described below. 

t 

A. Procedure 

1. * General 

Completed Fourth Follow-Up Questionnaires and Supplemental Question- 
naires were separately batched on receipt and routed to direct data entry 
following initial editing and code as;signment. The basic procedure for esti- 
mating the data entry error rate for both of these NLS instruments was as 
follows: 

ERIC . . 3 ' 



(a) A simple random sample of questionnaires. was selected from each 
\)atch after initial keying of the data. ^ - 

(b) The selected questionnaires were rekeyed by two additional operators. 

n (c) Error rates were determined on the basis of computer matching of the 
three separate keyings (original and two rekeys), 

2. Sampling 

By mutual agreement, three questio^inaires f rom ^each batch of 50 were ♦ 
to be selected for rekey, for a 'targeted sampling rate of six percent. An 
automated'^sampling routine designed to select, at the time of data entry, this 
six percent sample* was implemented at the start of data entry activity. 
Although not immediately recognized,^ problems were enc^ountered in computer 
sampling (machine prolJlems as well as inconsistencies in code) such that in 
many cases fewer than three questionnaires per ba^tch were automatically Selected. 
Consequently, a manual sampling procedure (using a table of random numbers) 
was employed subsequently to ensure that exactly three instruments from each 
batch were selected. Since the exact manual sampling procedure was implemented 
several weeks after keying began, the reali:2ed sampling jate for Fourth Follow-Up 
Questionnaire data entry quality control jwas approximately five percent, - 
which still provided good overall estimates as well as sufficient continued 
monitoring of the quality of the keying operation. A total of 922 sets of 
triplicate Fourth Follow-Up Questionnaires and 272 sets of triplicate Supple7 
mental Questionnaires were selected in this manner. 

3, • Error Model ^ 

To estimate the error rate for original keying, let £p t^^ and 6^ 
be the probability of a keying error for the initial data entry operator, the 
first rekey operator, and the second rekey operator, respectively. (It is not ^ 
assumed that = ^2 " ^3*^ Let N denote the number of elements (either 

single key-stroke characters or groups pf characters defining a particular 

* - * ' 

questionnaire item) involved in the records used for quality check, These^ N 
elements were independently keyed by the three operators. Thus, assume that 
the errors made by data entry operators are independent. 



- The problems with sampling by computer were recognized before Supplemental 
Questionnaire keying began. Thus, the manual sampling procedure was used from 
the start of Supplemental Questionnaire data entry, resulting in a realized 
sampling rate of six percent. 

O 2 
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Further, let 

n = Qumber of elements on which operators 1, 2, and 3 matched; 
a » 

XL = number of elements on which* operators 1 and 2 matched but operator 3 
did not; ... 

n =/ number of elements on which 'operators 1 and 3 matched but operator 2 
^ did not; 

n^ = number of elements on which operators 2 and 3 matched but operator 1 / 
did not; ^ . ' . ^ 

n = number of elements on which no two operators matched. 

Clearly, n + n^,+ n +a, + n =N. An element is assumed to be correctly 
' ' a D c a e , ^ 

keyed only when the master or initial keying matches at least one of the two 
rekeyrf (n^, n^, and n^ each denote numbers of correctly keyed variables). 

Let P. = n./N, (i = a, b , c, d, e) , be^the proportion of elements falling 
into category "i"; then the expected values of these proportions", E(P. ), ate 
given by: / ) 

E(P^j) = tl-ei)(l-e2)£3 

E(P^) = (1-Sj)(l-S3)s2 

E(P^) = (l-S2)(l-E3)Sj 

E(P^) = 8^8283 + (l-fi)e2e3 + (l-^2^h^3 ^ (l--^3^^^2- ■ - 

The empirically established error rate for experienced RTI data entry operators 

is less than half a percent; therefore, s^, e^^ and are assumed to be less 

th^n .005. Consequently,, as a first approximation terms^ of the t^e s.s. and 

of higher order (.i.e., s.s.s.) may be omitted.' Consequently, 

1 J K , ' 

E(P^) -/I - (8j + 82 + 83) ^ 

E(P^^) = £3 ' 

E(V^S2 ^: . . 

E(P^) S , ^ - 

E(P ) S 0/ 
e 
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A first approximation to the estimate can be obtained by equating the 

2/ 

sample quantities P^, P^, and P^ to their approximate expectations.- The 
standard error of the error rate estimate ca^ be calculated by first computing 
the error rate estimate, S^y for eactf record and then determining the variance 
of e- over records. Although the errors in elements within a record are 
likely to be correlated with each other, the assumption of independence between 
records is more tenable. ^> * /* / 

4. Implementation % / 

All completed Eourth Follow-Up Qilestionnaires and Supplemental 

. Questionnaires, returned by ma^l*feither from individual sample members or .f rom^ 
NLS field interviewers, were se|)arately batched in groups of 50 or less. A 
Batch Header Sheet was produced containing all ID numbers in a given batch, 
and questionnaires were subsequently identified and accounted for by this 
batch contror form which detailed the action on each questionnaire within the 

-4)atch. 

Following initial editing and code assignment, the batches of Fourth 
Follow-Up Questionnaires and Supplemental Questionnaires were^ assigned to the 
data entry operators who were responsible for keying all questionnaires in 
.their assigned batches. SiCs data entry task leaders randomly selected thriee 
questionnaires per batch for quality control purposes, using the procedures 
pre\^iously described. The three questionnaires selected to be reljexjpd were 
removed from the batch arid labeled "REKEY" on the front cover to denote i'ts 
selettion in the quality control sample. The NLS ID numbers for the selected* . 
instruments were also circled on t^e Batch Header Sheet by tlje task leader. 
An indicator variable identifying whether or not a '.particular questionnaire 
was sampled was keyed into the magnetic data record, foT use in constructing 
the file of sample instruments for quality control purposes. 

Questionnaires selected for the quality control sample were .then rekeyed 
by two additional operators; the data entry procedure for rekeying was iden- 
tical to the initial keying. Problems of interpretation and readability were 



2/ * 

- More exact estimates of rates and their standard errors may be obtained 



through maximum likelihood procedures. Since the likelihood equations are 
nonlinear and computation rather complex, it was decided to use P^ as the 
estimator of S- or the error rate for original keying. 
3/ 

- Some sample selection by computer was implemented at the beginning of the 
data entry process. 
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handled identically for the rekey operation as in the initial keying, consti- 
tuting a completely "blind" rekey effort to provide more accurate estimates of 
keying error. 

Error Rates for Fourth Follow-Up Questionnaire Data 

For Fourth Follow-Up Questionnaire data entry q\iality control purposes, 
two data entry error rates were computed, one based oi^. the number of variables 
(questionnaire items) keyed and the other based on the n^ber of individual 
characters keyed (one or more per variable). For example ,^"040". hours would 
be considered one variable consisting of the three characters: "0," "4," and 
"0." A total of 922 sets of triplicate questionnaires were sampled. The 
triplicate records were compared variable-by-variable and character-by-character 
(excluding open-ended questionnaire items) by a computer program which identi- 
fied the variables (questionnaire items) and characters (within variables) 
that were not keyed in exactly th€ same manners As indicated above, the master 
keying of a variable or charact:er was . considered correct if matched by at 
least one of the two rekeys. Single counts of the number of rskeyed variables 
and characters for which neither rekey matchfed the initial keying were computed, 
and these counts were converted to error rates by dividing by the number of 
keyed variables and the number o^ keyed characters, respectively. The resulting 
overall variable an4 character error rates for individual direct data entry 
operators are presented in Table 1. . 

4/ 

From- the start of fourth follow-up .data entry operations,- computer 
reports were 'generated at various points in the process^ to indicate the overall 
variable and character d^a entry error rates. A computer listing' of the 
variable (questionnaire item) errors that^were detected in each report was 
produced simultaneously. During initial data entry activity, reports generally 
were produced on a weekly, basis and later on a biweekly basis as the number of 

0 

questionnaires received at BTI decreased. However, the frequency of these 
quality control reports varied, depending on such factors as the number of 



As new operators were trained for NLS data entry, printouts of at least six 
test questionnaires keyed by the new operators were manually compared with the 
respective hard copy instruments by NIS project staff. The new operators were 
given additional instruction/retraining as necessary before beginning produc- 
tion keying. 
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Table 1. --Fourth Follow-Up Questionnaire variable and character error rates 

by operator 



NLS 
operator 

number 



Nymber of keyed 
questionnaires 

sampled-^ 



Number of 
variables 

keyed 



Operator 
variable 

error rate 



Number of 
characters 

keyed 



Operator 
character 

error rate 



1 




9 


* o9oo 


U 


A A 1 7 O 


1 ftl 7 1 


2 




81 


62694 


A 

0 


A AAQC 


1 /^O^ OQ 


3 




65 


50310 


0 


. uuiuy 


UiZ JO 


4 




24 


18576 


A 
U 


A A 1 O /. 

. UUIZh 


/, Q A t; 

HOHJU 


5 




2 


f c / o 

1548 


A 

0 


A AA^C 


AhO Q 


6 




2 


1548 


A 
U 




A AO Q 


7 




22 


17028 


A 
U 


A A 1 /. 7 


A A A 1 Q 


8 




3 


o o o o 

2J22 


A 
U 


A A AAA 
UUUUU 


An*^ 7 
uUj / 


9 




20 


15480 


A 
U 


A A-! tiQ 




10 




3 


2322 


A 

0 


A AO A 1 

OOoOl 


^At^ 7 


11 


66 


51084 


A 

0 




1 O O O A 


12 




50 


38700 


A 

0 


A A AO /. 

OOOoH 


1 A AQC A 


13 




43 


33282 


A 
U 


A A A A O 


oOo 1 / 


14 




36 


27864 


A 

0 


A A AO O 

OOOoz 


7 0/iQ A 
/ ZOOH 


15 




36 


27864 


A 

0 


A AO ^ fl 

002oy 


7 O ^ O A 
/ZOOH 


16 




38 


29412 


0 


00071 


76722 


17 




11 


59598 


0 


00305 


155463 


18 




10 


7740 


0 


00103 


20190 


19 




40 


30960 


0 


00362 


80760 


20 




52 


40248 


0 


00186 


104988 


21 




1 


774 


0 


01292 


2019 


22 




8 


6192 


0 


00113 


16152 


23 




47 


36378 


0 


00443 


94893 


24 




75 


58050 


0 


00053 


151425 


25 




6 


4644 


0. 


00409 


12114 


26 




50 


38700 


0. 


00173 


100950 


27 




7 


5418 


0. 


00055 


14133 


28 




12 


9288 


0. 


00603 


24228 


29 




7 


5418 


'o. 


00129 


14133 


30 




13 


10062 


0. 


00020 


26247 


31 




7 


5418 


0. 


0075,7 


14133 


32 




7 


5418 


0. 


00111 


14133 


33 




3 


2322 


0. 


00345 


6057 



0.00088 

0.00058 

0.00104 

0.00186 

0.00149 

0.00198 

0,00122 

0.00000 

0.00151„ , 

0.00528-' 

0.00035 

0.00040 

0.00046 

0,00039 

0.00259 

0.00042 

0.00176 

0.00094 

0.00300 

0.00152_ , 

0.00941-' 

0.00093 

0.00349 

0.00038 

0.00256 

0.00135 

0.00042 

0.00417 - 

0.00092 

0.00011 , 

0.01 465-' 

0.00127 

0.00495 



Although each operator was responsible for one or more batches, the number of 
sampled questionnaires is not always a multiple of three due to problems with 
computer sampling discussed earlier. 

Although the individual operator error rate is greater than 0.00500, the 
overall data entry error rate never exceeds the contractually specified 
toTefance level of .5 percent (see Figure 1)., Newly trained operators 
10, 21, and 31 keyed NLS data for only a short period of time as indicated 
by the minimal rfumbers of keyed questionnaires on which their error rate 
calculations are based. ' 

NOTE.— There are 774 variables and 2019 characters ger Fourth Follow-Up 
Questionnaire. , Open-ended responses and certain variables constant across 
Q -records, e.g., project number and data entry for^ number, were not used in < 
g j^(^letermining error rates. g 



operators keying, the number of questionnaires keyed, and the use of a second 
shift of data entry operators. Interim quality control reports were generated 
as necessary for the purpose of keeping close checks on operator performance 
(e.g., when newly trained operators were first in production mode); however, 
these interim data were not used> for reporting purposes. 

Figure 1 presents the overall (over operators) error rate results for 
variables (questionnaire items) from the eight major data entry quality control 
reports for Fourth Follow-Up Questionnaire data entry. From the data, it is 
evident that the 0.005 (.5 percent) overall error rate tolerance established 
for the NLS survey was not exceeded at any time .point. Over time the error 
rates' ranged from a high of 0.0018S (early in the data entry process) to a low 
of 0.00046.' Based on the tctal sample of 922 selected questionnaires, the 
estimated variable error rate was 0.00163 (based on 713,628 keyed variables) 
and the estimated character error r^te was 0.00136 (based on 1,861,518 keyed 
characters). 

C. Error Rates for Supplemental Questionnaire Data 

The procedure for determining Supplemental Questionnaire data entry error ' 
rates also consisted of selecting a six percent random sample of questionnaires 
from each keyed batch and resulted in a total of 272 sets of triplicate Supple- 
mental Questionnaires. Errors were calculated as described above through 
variable-by-variable and character-by-character comparison of the triplicate 
records. The resulting Supplemental Questionnaire variable and character 
error rates for the individual direct data entry operators are presented in 
Table 2. Since Supplemental Questionnaire data were keyed primarily by Fourth 
Follow-Up Questionnaire data, entry operators and since a six percent sample of 
returned instruments resulted in only 272 sets of triplicate questionnaires, 
only a few interim quality control reports were generated for the purpose of 
checking each operator's performance. Based on the 272 selected Supplemental 
Questionnaires, the variable error rate, over samples and operators , was , 
0.00040 (based on 42,704 keyed variables) and the estimated cha.racter 6rror 
rate was 0.00023 (based on 102,272 keyed characters). 
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Sample number 
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X = Computer 
report number 

1 
2 
3 
4 
5 
6 
7 
8 



= Error rate 

0.00139 
0.00188 
0.00175 
0,00085 
0.00046 
0.00074 
0.00104 
0.00057 



average line: y = 0.00163 
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Number of 
questionnaires 
on which error rate 
calculation based 

273 

72 
255 

47 

89 

42 

36 

41 



II 



The total number of records for error rate reports 1-8 does not equal the 
number of records (922) for which the total error rate was calculated. 
Each of. the eight groups of questionnaires contained Incomplete sets of 
keyihgs for several sample instruments (e.g., the original keying and first 
rekey with no second rekey present). No adjustments were made for these 
cases in the eight individual reports, but many of these incomplete sets o£ 
questionnaires wete completed for purposes of computing the total error rate 
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Table 2 . —Supplemeatal Questioanaire (S'Q) variables and character error 

rates by operator 



NLS SQ 
operator 
number 


Number of 
questionnaires 
keyed 


Number of 
variables 
keyed 


Operator 
variable 
error rate 


Number of 
characters 
keyed 


Operator 
character 
error rate 


1 


3 


471 


0 . obooo 


1128 


0.00000 


2 


25 


3925 


0.00076 


9400 


6.00032 


3 


86 


13502 


0.00022 


32336 


0.00015 


4 


57 


8949 


0.00011 


21432 


0.00005 


" 5 


78 


12246 


0.00073 


29328 


0.00044 


6 


23 


3611 


0.00028 


8648 


0.00023. 



NOTE. — There are 157 variables and 376 characters per Supplemental Questionnaire'. 
As in Fourth Follow-Up Questionnaire data entry, open-ended responses and certain 
variables constant across records, such as project number and data entry form 
number, were not used in computing error rates. 
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III. FOURTH FOLLOW-UP QUESTIONNAIRE TELEPHONE INTERVIEW ADDITIONS AND CORRECTIONS 



As in previous follow-up surveys, a set of "key" or critical questionnaire 
items were defined for fourth follow-up. If any of these key items were 
indeterminate (omitted or answered partially or inconsistent), then additional 
data collection pi:ocedures were implemented, consisting of attempts to resolve 
such indeterminacy through a telephone interview. The identification of 
indeterminacies was accomplished by a computer edit process (replacing the 
manual editing process used in prior follow-up surveys), which was applied to 
the set of key items once the data were keyeS into machine-readable form. 

As data from each questionnaire were computer-edited, a computer-generated 
problem sheet containing a list of questions and corresponding responses 
needing clarification or completion was produced for each questionnaire that 
failed the computer-edit process. The "fail-edit" questionnaires and their 
problem sheets were routed to telephone interviewers, who were\responsible f6^ 
contacting sample members and clarifying discrepancies, omissions, or in- 
consistencies in the questionnaire. All item corrections/res^olutions were 
recorded on an answer sheet that provided for correction of all "key" or ^ 
critical items, as necessary. These "fail-edit" answer sheets (with their 
associated questionnaire and computer-generated problem sheets) were resub- 
mitted to data entry, following any required manual coding, where only the new 
data rectjrded on the answer sheet by telephone interviewers were keyed, trans- 
mitted, and merged with the previously keyed questionnaire responses. 

Since both the number of key items and the number of respondents failing 

'J 

edit were small, all such additions and corrections obtained from the telephone 
interview process were 100 percent verified. This verification process involved 
a rekeying of data recorded on the answer sheet together with identifying 
information such as batch number, NLS ID number, and a short label (8-character 
mnemonic) for each questionnaire item with corrections data present. These 
corrections/additions were verified by a different operator than the original 
keyer, and the verifying ^operat'^r corrected, during the key-verification 
process, any errors found in the initial keying. 
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IV, FOURTH FOLLOW-UP DIRECTORY INFORMATION ENTRY 

One further data entry activity was Instituted to ensure additional 
accuracy in keying directory information (Section G of the Fourth Follow-Up 
Questionnaire). These data were entered as a separate step after all other 
questionnaire items were keyed. This information (e.g., name and address, 
phone number, social security number, driver's license number) was 100 percent 
verified by a different operator than the original keyer. The verifying 
operator corrected any errors detected in the initial keying. 
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